Purpose of this file is to explore UK ethnic census data.
It includes data extraction, data pre-processing and statistical analysis
Metadata :
Two data source was used to explore about ethnicity.
The UK census data : The UK census is a decennial survey conducted by the Office for National Statistics (ONS) to gather information about the population of the United Kingdom. 2021 census data is collected based on various categories such as demography and migration, UK armed forces veterans, ethnicity, identity, language and religion, work, travel, housing, health and education. This coding page uses ethnicity years census data was used namely 2001, 2011 and 2021.
The UK census lookup data: This data was used to find the relevant geographical unit from the UK census data. They specifically provide attributes such as LSOA, MSOA, OA and LAD geographical code and their geographical names. The lookup data is mainly associated with geographical information systems (GIS) to analyze and visualize geospatial data. The UK census lookup data is also released along with the UK census data every ten years. Using UK census data’s LSOA codes the geographical location of the specific code is extracted from the lookup data for further analysis of segregation and applying machine learning algorithm.
Link to datasource: https://www.ons.gov.uk/datasets/TS021/editions/2021/versions/2/filter-outputs/84b48c92-aced-4b75-a761-4213b8c6bb94#get-data
# Import required packages:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as pyo
import plotly.graph_objs as go
pyo.offline.init_notebook_mode()
from plotly.offline import iplot
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")
# Load 2021 UK census and LAD dataset
df_2021 = pd.read_csv("Dataset/Ethnic_census_data/2021_data/TS021-2021-2-filtered-2023-03-19T11-48-56Z.csv")
df_lad_2021 = pd.read_csv("Dataset/Ethnic_census_data/2021_data/OA21_LSOA21_MSOA21_LAD22_EW_LU.csv", encoding='latin-1')
#Check the shape of the dataframe
print(df_2021.shape)
print(df_lad_2021.shape)
(3777600, 5) (188880, 8)
#Visualize the UK 2021 census data
df_2021.head(2)
| Output Areas Code | Output Areas | Ethnic group (20 categories) Code | Ethnic group (20 categories) | Observation | |
|---|---|---|---|---|---|
| 0 | E00000001 | E00000001 | -8 | Does not apply | 0 |
| 1 | E00000001 | E00000001 | 1 | Asian, Asian British or Asian Welsh: Bangladeshi | 0 |
#Visualizing the UK 2021 LAD data
df_lad_2021.head(2)
| oa21cd | lsoa21cd | lsoa21nm | msoa21cd | msoa21nm | lad22cd | lad22nm | lad22nmw | |
|---|---|---|---|---|---|---|---|---|
| 0 | E00000001 | E01000001 | City of London 001A | E02000001 | City of London 001 | E09000001 | City of London | NaN |
| 1 | E00000003 | E01000001 | City of London 001A | E02000001 | City of London 001 | E09000001 | City of London | NaN |
# Renaming column names in data
df_2021 = df_2021.rename(columns={'Ethnic group (20 categories) Code': 'Ethnic_group_code', 'Ethnic group (20 categories)' : 'Ethnic_group'})
# Remove special character in the OA data columns
df_2021.columns = df_2021.columns.str.replace(' ', '_')
# counting unique values in 2021 census data
n = len(pd.unique(df_lad_2021["oa21cd"]))
print("No.of.unique values :",n)
No.of.unique values : 188880
#Transposing the column Ethnic group (20 categories) in census data
df_2021 = df_2021.pivot(index='Output_Areas',columns='Ethnic_group', values='Observation')
df_2021.head(1)
| Ethnic_group | Asian, Asian British or Asian Welsh: Bangladeshi | Asian, Asian British or Asian Welsh: Chinese | Asian, Asian British or Asian Welsh: Indian | Asian, Asian British or Asian Welsh: Other Asian | Asian, Asian British or Asian Welsh: Pakistani | Black, Black British, Black Welsh, Caribbean or African: African | Black, Black British, Black Welsh, Caribbean or African: Caribbean | Black, Black British, Black Welsh, Caribbean or African: Other Black | Does not apply | Mixed or Multiple ethnic groups: Other Mixed or Multiple ethnic groups | Mixed or Multiple ethnic groups: White and Asian | Mixed or Multiple ethnic groups: White and Black African | Mixed or Multiple ethnic groups: White and Black Caribbean | Other ethnic group: Any other ethnic group | Other ethnic group: Arab | White: English, Welsh, Scottish, Northern Irish or British | White: Gypsy or Irish Traveller | White: Irish | White: Other White | White: Roma |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Output_Areas | ||||||||||||||||||||
| E00000001 | 0 | 4 | 0 | 0 | 1 | 0 | 3 | 0 | 0 | 1 | 5 | 0 | 0 | 7 | 0 | 112 | 0 | 0 | 39 | 4 |
#Reset index for OA dataframe
df_2021 = df_2021.reset_index()
Combining 21 ethnic groups to 4
In 2021 census data there are 21 ethnic groups. Roma was one of new ethnic group added this year. To reduce the number of ethnicity, a person can be identified through one of the following four high-level ethnic groups as per www.ons.gov.uk:
# Checking data type for all the variables in OA data
df_2021.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 188880 entries, 0 to 188879 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Output_Areas 188880 non-null object 1 Asian, Asian British or Asian Welsh: Bangladeshi 188880 non-null int64 2 Asian, Asian British or Asian Welsh: Chinese 188880 non-null int64 3 Asian, Asian British or Asian Welsh: Indian 188880 non-null int64 4 Asian, Asian British or Asian Welsh: Other Asian 188880 non-null int64 5 Asian, Asian British or Asian Welsh: Pakistani 188880 non-null int64 6 Black, Black British, Black Welsh, Caribbean or African: African 188880 non-null int64 7 Black, Black British, Black Welsh, Caribbean or African: Caribbean 188880 non-null int64 8 Black, Black British, Black Welsh, Caribbean or African: Other Black 188880 non-null int64 9 Does not apply 188880 non-null int64 10 Mixed or Multiple ethnic groups: Other Mixed or Multiple ethnic groups 188880 non-null int64 11 Mixed or Multiple ethnic groups: White and Asian 188880 non-null int64 12 Mixed or Multiple ethnic groups: White and Black African 188880 non-null int64 13 Mixed or Multiple ethnic groups: White and Black Caribbean 188880 non-null int64 14 Other ethnic group: Any other ethnic group 188880 non-null int64 15 Other ethnic group: Arab 188880 non-null int64 16 White: English, Welsh, Scottish, Northern Irish or British 188880 non-null int64 17 White: Gypsy or Irish Traveller 188880 non-null int64 18 White: Irish 188880 non-null int64 19 White: Other White 188880 non-null int64 20 White: Roma 188880 non-null int64 dtypes: int64(20), object(1) memory usage: 30.3+ MB
Data type is checked to see if all ethnic group fields is an integer for further computing.
# Computing 4 ethnic groups : Select required columns to sum using DataFrame to create new variables with 4 only ethinic group for further analysis
df_2021['Asian']=df_2021.iloc[:,1:6].sum(axis=1)
df_2021['Black']=df_2021.iloc[:,6:9].sum(axis=1)
df_2021['White']=df_2021.iloc[:,16:21].sum(axis=1)
df_2021['Others']=df_2021.iloc[:,9:16].sum(axis=1)
# Dropping other features and including only required features in OA dataset
df_2021.drop(df_2021.iloc[:,1:-4], inplace=True, axis=1)
# Checking the null values in Census 2021 data
df_2021.isnull().sum()
Ethnic_group Output_Areas 0 Asian 0 Black 0 White 0 Others 0 dtype: int64
No null values were identifed.
#Statistical analysis
df_2021.describe()
| Ethnic_group | Asian | Black | White | Others |
|---|---|---|---|---|
| count | 188880.000000 | 188880.000000 | 188880.000000 | 188880.000000 |
| mean | 29.203155 | 12.754024 | 257.833095 | 15.745775 |
| std | 57.525246 | 26.608705 | 91.101243 | 18.159850 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 2.000000 | 0.000000 | 209.000000 | 4.000000 |
| 50% | 9.000000 | 3.000000 | 261.000000 | 10.000000 |
| 75% | 28.000000 | 12.000000 | 304.000000 | 21.000000 |
| max | 986.000000 | 902.000000 | 2978.000000 | 435.000000 |
# Renaming LDA columns for joining the data with census data
df_lad_2021.rename(columns={"oa21cd": "Output_Areas", "lad22nm" : "LAD", "lsoa21cd":"LSOA"}, inplace=True)
#Dropping the other columns
df_lad_to_join = df_lad_2021[['Output_Areas', 'LAD', "LSOA"]].copy()
df_lad_to_join.head(2)
| Output_Areas | LAD | LSOA | |
|---|---|---|---|
| 0 | E00000001 | City of London | E01000001 |
| 1 | E00000003 | City of London | E01000001 |
# Join OA census 2021 and LAD dataframe
join_data = pd.merge(left = df_2021, right = df_lad_to_join, how = 'left', on = 'Output_Areas' )
join_data.head(1)
| Output_Areas | Asian | Black | White | Others | LAD | LSOA | |
|---|---|---|---|---|---|---|---|
| 0 | E00000001 | 5 | 3 | 155 | 13 | City of London | E01000001 |
# Re-arranging the order of the column in joined dataset
join_data = join_data.reindex(['LSOA','LAD','Output_Areas','Asian', 'Black', 'White','Others'], axis=1)
join_data.head(2)
| LSOA | LAD | Output_Areas | Asian | Black | White | Others | |
|---|---|---|---|---|---|---|---|
| 0 | E01000001 | City of London | E00000001 | 5 | 3 | 155 | 13 |
| 1 | E01000001 | City of London | E00000003 | 24 | 5 | 209 | 17 |
# counting unique values in join_data
n = len(pd.unique(join_data["LSOA"]))
print("No.of.unique values :",n)
# counting unique values in join_data
n = len(pd.unique(join_data["LAD"]))
print("No.of.unique values :",n)
No.of.unique values : 35672 No.of.unique values : 331
#Saving joined 2021 census dataset:
join_data.to_csv("Dataset/Processed_Data/processed_ethnic_data_2021.csv")
#Loading the joined census dataset
df_census_2021 = pd.read_csv("Dataset/Processed_Data/processed_ethnic_data_2021.csv", index_col=0)
# Use groupby() to compute the sum
df2 =df_census_2021.groupby(['LAD']).sum()
df2 = df2.reset_index()
df2.head(2)
| LAD | Asian | Black | White | Others | |
|---|---|---|---|---|---|
| 0 | Adur | 1388 | 536 | 60284 | 2329 |
| 1 | Allerdale | 541 | 85 | 94671 | 853 |
#Description of Datasets
df2.describe(include='all')
| LAD | Asian | Black | White | Others | |
|---|---|---|---|---|---|
| count | 331 | 331.000000 | 331.000000 | 331.000000 | 331.000000 |
| unique | 331 | NaN | NaN | NaN | NaN |
| top | Adur | NaN | NaN | NaN | NaN |
| freq | 1 | NaN | NaN | NaN | NaN |
| mean | NaN | 16664.326284 | 7277.885196 | 147128.444109 | 8985.081571 |
| std | NaN | 32901.122980 | 15359.065822 | 87905.046701 | 12421.411368 |
| min | NaN | 6.000000 | 4.000000 | 2004.000000 | 40.000000 |
| 25% | NaN | 1874.000000 | 583.500000 | 94197.500000 | 2256.500000 |
| 50% | NaN | 4928.000000 | 1397.000000 | 122759.000000 | 4116.000000 |
| 75% | NaN | 14998.000000 | 5225.500000 | 165033.000000 | 8891.000000 |
| max | NaN | 355386.000000 | 125741.000000 | 641847.000000 | 107194.000000 |
dfplot = df2.sort_values(by="Asian",ascending=False).head(20)
#Using a horizontal bar chat : Ploting the popluation of Asian
plt.barh(dfplot.LAD, dfplot.Asian, label='Asian')
#Ploting the popluation of Black
plt.barh(dfplot.LAD, dfplot.Black, label='White')
#Adding legend and labels
plt.legend(loc='center right', prop={'size': 6})
plt.title('popluation percentage by Ethnic Groups - Asian and White')
plt.xlabel('Poplulation')
plt.ylabel('UK')
plt.rcParams['figure.figsize'] = (6,4)
#Displaying the plot
plt.show()
dfplot = df2.sort_values(by="Black",ascending=False).head(20)
#Using a horizontal bar chat : Ploting the popluation of Others
plt.barh(dfplot.LAD, dfplot.Asian, label='Black')
#Ploting the popluation of White
plt.barh(dfplot.LAD, dfplot.Black, label='Others')
#Adding legend and labels
plt.legend(loc='center right', prop={'size': 6})
plt.title('popluation percentage by Ethnic Groups - Black and Others')
plt.xlabel('Poplulation')
plt.ylabel('UK')
plt.rcParams['figure.figsize'] = (6,4)
#Displaying the plot
plt.show()
#Total popluation in the UK - creatin this for visualization purpose
Total_UK_2021_asian = df2['Asian'].sum()
Total_UK_2021_black = df2['Black'].sum()
Total_UK_2021_white = df2['White'].sum()
Total_UK_2021_others = df2['Others'].sum()
# Selecting only Bradford from the joined dataset
data_bradford = df_census_2021[(df_census_2021.LAD == "Bradford")]
data_bradford.head(2)
| LSOA | LAD | Output_Areas | Asian | Black | White | Others | |
|---|---|---|---|---|---|---|---|
| 50292 | E01010573 | Bradford | E00053353 | 10 | 0 | 422 | 3 |
| 50293 | E01010573 | Bradford | E00053354 | 3 | 1 | 317 | 6 |
#checking the size of census data
data_bradford.shape
(1575, 7)
# counting unique values in join_data
n = len(pd.unique(data_bradford["LSOA"]))
print("No.of.unique values :",n)
No.of.unique values : 312
#statistical analysis of census data
data_bradford.describe()
| Asian | Black | White | Others | |
|---|---|---|---|---|
| count | 1575.000000 | 1575.000000 | 1575.000000 | 1575.000000 |
| mean | 111.524444 | 6.987302 | 212.053968 | 16.365079 |
| std | 142.526573 | 10.622246 | 108.529429 | 13.891371 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 8.000000 | 0.000000 | 126.000000 | 6.000000 |
| 50% | 35.000000 | 3.000000 | 232.000000 | 12.000000 |
| 75% | 185.000000 | 9.000000 | 283.000000 | 23.000000 |
| max | 586.000000 | 102.000000 | 582.000000 | 83.000000 |
# creatin this for visualization purpose
Total_Bradford_2021_asian = data_bradford['Asian'].sum()
Total_Bradford_2021_black = data_bradford['Black'].sum()
Total_Bradford_2021_white = data_bradford['White'].sum()
Total_Bradford_2021_others = data_bradford['Others'].sum()
#Visualizing the range and distribution of numerical Variables
#Distribution of people in Bradford
fig, ax = plt.subplots(4, 2, figsize = (8, 6))
sns.boxplot(x= data_bradford["Asian"], ax = ax[0,0])
sns.distplot(data_bradford['Asian'], ax = ax[0,1])
sns.boxplot(x= data_bradford["Black"], ax = ax[1,0])
sns.distplot(data_bradford['Black'], ax = ax[1,1])
sns.boxplot(x= data_bradford["White"], ax = ax[2,0])
sns.distplot(data_bradford['White'], ax = ax[2,1])
sns.boxplot(x= data_bradford["Others"], ax = ax[3,0])
sns.distplot(data_bradford['Others'], ax = ax[3,1])
plt.tight_layout()
Metadata :
census - https://www.nomisweb.co.uk/query/construct/submit.asp?menuopt=201&subcomp=
#Load the 2011 OA and LAD dataset
df_2011 = pd.read_csv("Dataset/Ethnic_census_data/2011_data/2011_1850896343189182.csv", sep=';', header=None, encoding='latin-1')
df_lad_2011 = pd.read_csv("Dataset/Ethnic_census_data/2011_data/Output_Area_to_Lower_layer_Super_Output_Area_to_Middle_layer_Super_Output_Area_to_Local_Authority_District_(December_2011)_Lookup_in_England_and_Wales.csv", encoding='latin-1')
#Using drop()function to delete first 5 rows in OA dataset
df_2011.drop(index=df_2011.index[:6], axis=0, inplace=True)
#Renaming the column name in OA dataset
df_2011.rename(columns={df_2011.columns[0]: 'Ethnic_Group'},inplace=True)
print(df_2011.columns)
Index(['Ethnic_Group'], dtype='object')
#Splitting the first column into multiple columns in OA data
df_2011 = df_2011.Ethnic_Group.str.split(',',expand=True)
# Set First Row as header in OA data
df_2011.columns = df_2011.iloc[0]
df_2011 = df_2011[1:]
# checking the column names in 2011 census OA data
df_2011.columns
Index(['2011 output area', '"All usual residents"', '"White"',
'"White: English/Welsh/Scottish/Northern Irish/British"',
'"White: Irish"', '"White: Gypsy or Irish Traveller"',
'"White: Other White"', '"Mixed/multiple ethnic groups"',
'"Mixed/multiple ethnic groups: White and Black Caribbean"',
'"Mixed/multiple ethnic groups: White and Black African"',
'"Mixed/multiple ethnic groups: White and Asian"',
'"Mixed/multiple ethnic groups: Other Mixed"', '"Asian/Asian British"',
'"Asian/Asian British: Indian"', '"Asian/Asian British: Pakistani"',
'"Asian/Asian British: Bangladeshi"', '"Asian/Asian British: Chinese"',
'"Asian/Asian British: Other Asian"',
'"Black/African/Caribbean/Black British"',
'"Black/African/Caribbean/Black British: African"',
'"Black/African/Caribbean/Black British: Caribbean"',
'"Black/African/Caribbean/Black British: Other Black"',
'"Other ethnic group"', '"Other ethnic group: Arab"',
'"Other ethnic group: Any other ethnic group"'],
dtype='object', name=6)
#Remove special character in headers
df_2011.columns=df_2011.columns.str.replace('["]','')
#Renaming the column names
df_2011.rename(columns={'White' : 'All_white',
'2011 output area': 'Output_Areas'}, inplace=True)
#changing the datatype for all the columns to numeric for further computation in OA data
df_2011['All_white'] = pd.to_numeric(df_2011['All_white'], errors='coerce')
df_2011['White: English/Welsh/Scottish/Northern Irish/British'] = pd.to_numeric(df_2011['White: English/Welsh/Scottish/Northern Irish/British'], errors='coerce')
df_2011['White: Irish'] = pd.to_numeric(df_2011['White: Irish'], errors='coerce')
df_2011['White: Gypsy or Irish Traveller'] = pd.to_numeric(df_2011['White: Gypsy or Irish Traveller'], errors='coerce')
df_2011['White: Other White'] = pd.to_numeric(df_2011['White: Other White'], errors='coerce')
df_2011['Mixed/multiple ethnic groups'] = pd.to_numeric(df_2011['Mixed/multiple ethnic groups'], errors='coerce')
df_2011['Mixed/multiple ethnic groups: White and Black Caribbean'] = pd.to_numeric(df_2011['Mixed/multiple ethnic groups: White and Black Caribbean'], errors='coerce')
df_2011['Mixed/multiple ethnic groups: White and Black African'] = pd.to_numeric(df_2011['Mixed/multiple ethnic groups: White and Black African'], errors='coerce')
df_2011['Mixed/multiple ethnic groups: White and Asian'] = pd.to_numeric(df_2011['Mixed/multiple ethnic groups: White and Asian'], errors='coerce')
df_2011['Mixed/multiple ethnic groups: Other Mixed'] = pd.to_numeric(df_2011['Mixed/multiple ethnic groups: Other Mixed'], errors='coerce')
df_2011['Asian/Asian British'] = pd.to_numeric(df_2011['Asian/Asian British'], errors='coerce')
df_2011['Asian/Asian British: Indian'] = pd.to_numeric(df_2011['Asian/Asian British: Indian'], errors='coerce')
df_2011['Asian/Asian British: Pakistani'] = pd.to_numeric(df_2011['Asian/Asian British: Pakistani'], errors='coerce')
df_2011['Asian/Asian British: Bangladeshi'] = pd.to_numeric(df_2011['Asian/Asian British: Bangladeshi'], errors='coerce')
df_2011['Asian/Asian British: Chinese'] = pd.to_numeric(df_2011['Asian/Asian British: Chinese'], errors='coerce')
df_2011['Asian/Asian British: Other Asian'] = pd.to_numeric(df_2011['Asian/Asian British: Other Asian'], errors='coerce')
df_2011['Black/African/Caribbean/Black British'] = pd.to_numeric(df_2011['Black/African/Caribbean/Black British'], errors='coerce')
df_2011['Black/African/Caribbean/Black British: African'] = pd.to_numeric(df_2011['Black/African/Caribbean/Black British: African'], errors='coerce')
df_2011['Black/African/Caribbean/Black British: Caribbean'] = pd.to_numeric(df_2011['Black/African/Caribbean/Black British: Caribbean'], errors='coerce')
df_2011['Black/African/Caribbean/Black British: Other Black'] = pd.to_numeric(df_2011['Black/African/Caribbean/Black British: Other Black'], errors='coerce')
df_2011['Other ethnic group'] = pd.to_numeric(df_2011['Other ethnic group'], errors='coerce')
df_2011['Other ethnic group: Arab'] = pd.to_numeric(df_2011['Other ethnic group: Arab'], errors='coerce')
df_2011['Other ethnic group: Any other ethnic group'] = pd.to_numeric(df_2011['Other ethnic group: Any other ethnic group'], errors='coerce')
# Select required columns to sum using DataFrame.iloc[]
df_2011['Asian']=df_2011.iloc[:,12:18].sum(axis=1)
df_2011['Black']=df_2011.iloc[:,18:22].sum(axis=1)
df_2011['White']=df_2011.iloc[:,2:7].sum(axis=1)
df_2011['Others']=df_2011.iloc[:,[7,8,9,10,11,22,23,24]].sum(axis=1)
# Dropping other features and including only required features
df_2011.drop(df_2011.iloc[:,1:-4], inplace=True, axis=1)
df_2011.head(1)
| 6 | Output_Areas | Asian | Black | White | Others |
|---|---|---|---|---|---|
| 7 | E00000001 | 12.0 | 0.0 | 350.0 | 26.0 |
df_lad_2011.head(1)
| OA11CD | LSOA11CD | LSOA11NM | MSOA11CD | MSOA11NM | LAD11CD | LAD11NM | LAD11NMW | ObjectId | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | E00000001 | E01000001 | City of London 001A | E02000001 | City of London 001 | E09000001 | City of London | NaN | 1 |
#Renaming LDA columns
df_lad_2011.rename(columns={"OA11CD": "Output_Areas", "LAD11NM" : "LAD", "LSOA11CD":"LSOA"}, inplace=True)
#Dropping the other columns
df_lad_to_join_2011 = df_lad_2011[['Output_Areas', 'LAD', "LSOA"]].copy() # To avoid the case where changing df1 also changes df
df_lad_to_join_2011.head(1)
| Output_Areas | LAD | LSOA | |
|---|---|---|---|
| 0 | E00000001 | City of London | E01000001 |
# Join 2011 census and LAD dataframe
join_data_2011 = pd.merge(left = df_2011, right = df_lad_to_join_2011, how = 'left', on = 'Output_Areas' )
# Re-arranging the order of the column in joined dataset
join_data_2011 = join_data_2011.reindex(['LSOA','LAD','Output_Areas','Asian', 'Black', 'White','Others'], axis=1)
join_data_2011.head(1)
| LSOA | LAD | Output_Areas | Asian | Black | White | Others | |
|---|---|---|---|---|---|---|---|
| 0 | E01000001 | City of London | E00000001 | 12.0 | 0.0 | 350.0 | 26.0 |
# counting unique values in join_data
n = len(pd.unique(join_data_2011["LSOA"]))
print("No.of.unique values :",n)
# counting unique values in join_data
n = len(pd.unique(join_data_2011["LAD"]))
print("No.of.unique values :",n)
No.of.unique values : 34754 No.of.unique values : 349
# converting 'Weight' from float to int
join_data_2011['Asian'] = join_data_2011['Asian'].astype(int)
join_data_2011['Black'] = join_data_2011['Black'].astype(int)
join_data_2011['White'] = join_data_2011['White'].astype(int)
join_data_2011['Others'] = join_data_2011['Others'].astype(int)
#Saving joined dataset 2011:
join_data_2011.to_csv("Dataset/Processed_Data/processed_ethnic_data_2011.csv")
#Loading the joined census dataset
df_census_2011 = pd.read_csv("Dataset/Processed_Data/processed_ethnic_data_2011.csv", index_col=0)
df_census_2011.head(2)
| LSOA | LAD | Output_Areas | Asian | Black | White | Others | |
|---|---|---|---|---|---|---|---|
| 0 | E01000001 | City of London | E00000001 | 12 | 0 | 350 | 26 |
| 1 | E01000001 | City of London | E00000003 | 52 | 6 | 412 | 30 |
df_census_2011.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 181412 entries, 0 to 181411 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 LSOA 181408 non-null object 1 LAD 181408 non-null object 2 Output_Areas 181408 non-null object 3 Asian 181412 non-null int64 4 Black 181412 non-null int64 5 White 181412 non-null int64 6 Others 181412 non-null int64 dtypes: int64(4), object(3) memory usage: 11.1+ MB
dfplot = df_census_2011.sort_values(by="Asian",ascending=False).head(20)
#Using a horizontal bar chat
#Ploting the popluation of Asian
plt.barh(dfplot.LAD, dfplot.Asian, label='Asian')
#Ploting the popluation of Black
plt.barh(dfplot.LAD, dfplot.Black, label='White')
#Adding legend and labels
plt.legend(loc='center right', prop={'size': 8})
plt.title('popluation by Ethnic Groups - Asians and White')
plt.xlabel('Poplulation - 2011')
plt.ylabel('UK')
plt.rcParams['figure.figsize'] = (6,4)
#Displaying the plot
plt.show()
dfplot = df_census_2011.sort_values(by="Black",ascending=False).head(20)
#Using a horizontal bar chat
#Ploting the popluation of Asian
plt.barh(dfplot.LAD, dfplot.Asian, label='Black')
#Ploting the popluation of Black
plt.barh(dfplot.LAD, dfplot.Black, label='Others')
#Adding legend and labels
plt.legend(loc='center right', prop={'size': 8})
plt.title('popluation by Ethnic Groups - Black and others')
plt.xlabel('Poplulation - 2011')
plt.ylabel('UK')
plt.rcParams['figure.figsize'] = (6,4)
#Displaying the plot
plt.show()
#Total popluation in the UK - # creatin this for visualization purpose
Total_UK_2011_asian = df_census_2011['Asian'].sum()
Total_UK_2011_black = df_census_2011['Black'].sum()
Total_UK_2011_white = df_census_2011['White'].sum()
Total_UK_2011_others = df_census_2011['Others'].sum()
# Selecting only Bradford from the joined dataset
data_bradford_2011 = df_census_2011[(df_census_2011.LAD == "Bradford")]
# creating new variables for visualization purpose
Total_Bradford_2011_asian = data_bradford_2011['Asian'].sum()
Total_Bradford_2011_black = data_bradford_2011['Black'].sum()
Total_Bradford_2011_white = data_bradford_2011['White'].sum()
Total_Bradford_2011_others = data_bradford_2011['Others'].sum()
#Load the OA census 2001 and LAD dataset
df_2001 = pd.read_csv("Dataset/Ethnic_census_data/2001_data/2001_census_data.csv", sep=';', header=None, encoding='latin-1')
df_lad_2001 = pd.read_csv("Dataset/Ethnic_census_data/2001_data/Output_Area_to_Lower_Layer_Super_Output_Area_to_Middle_Layer_Super_Output_Area_(December_2001)_Lookup_in_England_and_Wales.csv", encoding='latin-1')
#Set first Row as Header in 2001 census data
df_2001.columns = df_2001.iloc[0]
df_2001 = df_2001[1:]
# Taking only first columns because it has all the values for the ethnic group and dropping other columns
df_2001.drop(df_2001.iloc[:,1:], inplace=True, axis=1)
df_2001.head(1)
| date,"geography","geography code","Ethnic Group: All categories: Ethnic group | |
|---|---|
| 1 | 2001,"00EHNC0001","00EHNC0001",261,255,255,0,0... |
#Renaming the column name
df_2001.rename(columns={df_2001.columns[0]: 'Features'},inplace=True)
print(df_2001.columns)
Index(['Features'], dtype='object', name=0)
#Splitting the first column into multiple columns
df_2001 = df_2001.Features.str.split(',',expand=True)
df_2001.shape
(175434, 25)
df_2001.columns = ['date',
'geography',
'geography_code',
'All_categories',
'All_White',
'White_British',
'White_Irish',
'White_Other',
'Mixed',
'Mixed_White_Black_Caribbean',
'Mixed_White_Black_African',
'Mixed_White_Asian',
'Mixed_Other',
'Asian_Asian_British',
'Asian_Asian_British_Indian',
'Asian_Asian_British_Pakistani',
'Asian_Asian_British_Bangladeshi',
'Asian_Asian_British_Other',
'Black_Black_British',
'Black_Black_Caribbean',
'Black_Black_African',
'Black_other',
'Chinese_other',
'Chinese_other_Chinese',
'Chinese_other_other']
print(df_2001.columns)
Index(['date', 'geography', 'geography_code', 'All_categories', 'All_White',
'White_British', 'White_Irish', 'White_Other', 'Mixed',
'Mixed_White_Black_Caribbean', 'Mixed_White_Black_African',
'Mixed_White_Asian', 'Mixed_Other', 'Asian_Asian_British',
'Asian_Asian_British_Indian', 'Asian_Asian_British_Pakistani',
'Asian_Asian_British_Bangladeshi', 'Asian_Asian_British_Other',
'Black_Black_British', 'Black_Black_Caribbean', 'Black_Black_African',
'Black_other', 'Chinese_other', 'Chinese_other_Chinese',
'Chinese_other_other'],
dtype='object')
#Remove special characters from columns in OA data
df_2001['geography'] = df_2001['geography'].str.replace('\W', '', regex=True)
df_2001['geography_code'] = df_2001['geography_code'].str.replace('\W', '', regex=True)
#Changing the data type for OA dataset
df_2001['All_White'] = pd.to_numeric(df_2001['All_White'], errors='coerce')
df_2001['White_British'] = pd.to_numeric(df_2001['White_British'], errors='coerce')
df_2001['White_Irish'] = pd.to_numeric(df_2001['White_Irish'], errors='coerce')
df_2001['White_Other'] = pd.to_numeric(df_2001['White_Other'], errors='coerce')
df_2001['Mixed'] = pd.to_numeric(df_2001['Mixed'], errors='coerce')
df_2001['Mixed_White_Black_Caribbean'] = pd.to_numeric(df_2001['Mixed_White_Black_Caribbean'], errors='coerce')
df_2001['Mixed_White_Black_African'] = pd.to_numeric(df_2001['Mixed_White_Black_African'], errors='coerce')
df_2001['Mixed_White_Asian'] = pd.to_numeric(df_2001['Mixed_White_Asian'], errors='coerce')
df_2001['Mixed_Other'] = pd.to_numeric(df_2001['Mixed_Other'], errors='coerce')
df_2001['Asian_Asian_British'] = pd.to_numeric(df_2001['Asian_Asian_British'], errors='coerce')
df_2001['Asian_Asian_British_Indian'] = pd.to_numeric(df_2001['Asian_Asian_British_Indian'], errors='coerce')
df_2001['Asian_Asian_British_Pakistani'] = pd.to_numeric(df_2001['Asian_Asian_British_Pakistani'], errors='coerce')
df_2001['Asian_Asian_British_Bangladeshi'] = pd.to_numeric(df_2001['Asian_Asian_British_Bangladeshi'], errors='coerce')
df_2001['Asian_Asian_British_Other'] = pd.to_numeric(df_2001['Asian_Asian_British_Other'], errors='coerce')
df_2001['Black_Black_British'] = pd.to_numeric(df_2001['Black_Black_British'], errors='coerce')
df_2001['Black_Black_Caribbean'] = pd.to_numeric(df_2001['Black_Black_Caribbean'], errors='coerce')
df_2001['Black_Black_African'] = pd.to_numeric(df_2001['Black_Black_African'], errors='coerce')
df_2001['Black_other'] = pd.to_numeric(df_2001['Black_other'], errors='coerce')
df_2001['Chinese_other'] = pd.to_numeric(df_2001['Chinese_other'], errors='coerce')
df_2001['Chinese_other_Chinese'] = pd.to_numeric(df_2001['Chinese_other_Chinese'], errors='coerce')
df_2001['Chinese_other_other'] = pd.to_numeric(df_2001['Chinese_other_other'], errors='coerce')
df_2001.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 175434 entries, 1 to 175434 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 175434 non-null object 1 geography 175434 non-null object 2 geography_code 175434 non-null object 3 All_categories 175434 non-null object 4 All_White 175434 non-null int64 5 White_British 175434 non-null int64 6 White_Irish 175434 non-null int64 7 White_Other 175434 non-null int64 8 Mixed 175434 non-null int64 9 Mixed_White_Black_Caribbean 175434 non-null int64 10 Mixed_White_Black_African 175434 non-null int64 11 Mixed_White_Asian 175434 non-null int64 12 Mixed_Other 175434 non-null int64 13 Asian_Asian_British 175434 non-null int64 14 Asian_Asian_British_Indian 175434 non-null int64 15 Asian_Asian_British_Pakistani 175434 non-null int64 16 Asian_Asian_British_Bangladeshi 175434 non-null int64 17 Asian_Asian_British_Other 175434 non-null int64 18 Black_Black_British 175434 non-null int64 19 Black_Black_Caribbean 175434 non-null int64 20 Black_Black_African 175434 non-null int64 21 Black_other 175434 non-null int64 22 Chinese_other 175434 non-null int64 23 Chinese_other_Chinese 175434 non-null int64 24 Chinese_other_other 175434 non-null int64 dtypes: int64(21), object(4) memory usage: 33.5+ MB
# Select required columns to sum using DataFrame.iloc[]
df_2001['Asian'] = df_2001.iloc[:,13:18].sum(axis=1)
df_2001['Black'] = df_2001.iloc[:,18:22].sum(axis=1)
df_2001['White'] = df_2001.iloc[:,4:8].sum(axis=1)
df_2001['Others'] = df_2001.iloc[:,[8,9,10,11,12,22,23,24]].sum(axis=1)
df_2001.head(1)
| date | geography | geography_code | All_categories | All_White | White_British | White_Irish | White_Other | Mixed | Mixed_White_Black_Caribbean | ... | Black_Black_Caribbean | Black_Black_African | Black_other | Chinese_other | Chinese_other_Chinese | Chinese_other_other | Asian | Black | White | Others | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2001 | 00EHNC0001 | 00EHNC0001 | 261 | 255 | 255 | 0 | 0 | 3 | 0 | ... | 0 | 0 | 0 | 3 | 3 | 0 | 0 | 0 | 510 | 12 |
1 rows × 29 columns
#Dropping other features and including only required features
df_2001.drop(df_2001.iloc[:,3:-4], inplace=True, axis=1)
#Counting unique values in 2021 census data
n = len(pd.unique(df_2001["geography"]))
print("No.of.unique values :",n)
#Counting unique values in LAD data
n = len(pd.unique(df_lad_2001["OA01CD"]))
print("No.of.unique values :",n)
No.of.unique values : 175434 No.of.unique values : 175434
#Renaming the column name in LAD data
df_lad_2001.rename(columns={df_lad_2001.columns[0]: 'geography'},inplace=True)
#Dropping other features and including only required features in LAD data
df_lad_2001.drop(df_lad_2001.iloc[:,3:], inplace=True, axis=1)
#Remove the last 3 digits in location in LAD data
df_lad_2001['LSOA01NM'] = df_lad_2001['LSOA01NM'].str[:-5]
df_lad_2001.head(1)
| geography | LSOA01CD | LSOA01NM | |
|---|---|---|---|
| 0 | 00AAFA0001 | E01000001 | City of London |
#Renaming the column name for LAD data for joining the data
df_lad_2001.rename(columns={'LSOA01CD':'LSOA','LSOA01NM':'LAD' },inplace=True)
#Checking if LAD and 2001 census data observations are same
df_lad_2001.shape
(175434, 3)
# Join OA and LAD data
join_data_2001 = pd.merge(left = df_2001, right = df_lad_2001, how = 'left', on = 'geography' )
join_data_2001.head(1)
| date | geography | geography_code | Asian | Black | White | Others | LSOA | LAD | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2001 | 00EHNC0001 | 00EHNC0001 | 0 | 0 | 510 | 12 | E01012305 | Darlington |
#Dropping other features and including only required features in OA dataset
join_data_2001.drop(join_data_2001.iloc[:,[0,2]], inplace=True, axis=1)
join_data_2001.head(2)
| geography | Asian | Black | White | Others | LSOA | LAD | |
|---|---|---|---|---|---|---|---|
| 0 | 00EHNC0001 | 0 | 0 | 510 | 12 | E01012305 | Darlington |
| 1 | 00EHNC0002 | 0 | 0 | 624 | 0 | E01012305 | Darlington |
#Re-arranging the order of the column in joined data
census_2001 = join_data_2001.reindex(['LSOA', 'LAD', 'geography', 'Asian', 'Black', 'White', 'Others'], axis=1)
census_2001.head(2)
| LSOA | LAD | geography | Asian | Black | White | Others | |
|---|---|---|---|---|---|---|---|
| 0 | E01012305 | Darlington | 00EHNC0001 | 0 | 0 | 510 | 12 |
| 1 | E01012305 | Darlington | 00EHNC0002 | 0 | 0 | 624 | 0 |
#Saving joined dataset 2011:
census_2001.to_csv("Dataset/Processed_Data/processed_ethnic_data_2001.csv")
#Loading the joined census dataset
df_census_2001 = pd.read_csv("Dataset/Processed_Data/processed_ethnic_data_2001.csv", index_col=0)
df_census_2001.head(2)
| LSOA | LAD | geography | Asian | Black | White | Others | |
|---|---|---|---|---|---|---|---|
| 0 | E01012305 | Darlington | 00EHNC0001 | 0 | 0 | 510 | 12 |
| 1 | E01012305 | Darlington | 00EHNC0002 | 0 | 0 | 624 | 0 |
dfplot = df_census_2001.sort_values(by="Asian",ascending=False).head(20)
#Using a horizontal bar chat
#Ploting the popluation of Asian
plt.barh(dfplot.LAD, dfplot.Asian, label='Asian')
#Ploting the popluation of Black
plt.barh(dfplot.LAD, dfplot.Black, label='White')
#Adding legend and labels
plt.legend(loc='center right', prop={'size': 6})
plt.title('popluation by Ethnic Groups - Asians and White')
plt.xlabel('Poplulation - 2001')
plt.ylabel('UK')
plt.rcParams['figure.figsize'] = (6,4)
#Displaying the plot
plt.show()
dfplot = df_census_2001.sort_values(by="Black",ascending=False).head(20)
#Using a horizontal bar chat
#Ploting the popluation of Asian
plt.barh(dfplot.LAD, dfplot.Asian, label='Black')
#Ploting the popluation of Black
plt.barh(dfplot.LAD, dfplot.Black, label='Others')
#Adding legend and labels
plt.legend(loc='center right', prop={'size': 6})
plt.title('popluation by Ethnic Groups - Black and Others')
plt.xlabel('Poplulation - 2001')
plt.ylabel('UK')
plt.rcParams['figure.figsize'] = (6,4)
#Displaying the plot
plt.show()
#Total popluation in the UK - creatin this for visualization purpose
Total_UK_2001_asian = int(df_census_2001['Asian'].sum())
Total_UK_2001_black = int(df_census_2001['Black'].sum())
Total_UK_2001_white = int(df_census_2001['White'].sum())
Total_UK_2001_others = int(df_census_2001['Others'].sum())
# Selecting only Bradford from the joined dataset
data_bradford_2001 = df_census_2001[(df_census_2001.LAD == "Bradford")]
# creatin this for visualization purpose
Total_Bradford_2001_asian = data_bradford_2001['Asian'].sum()
Total_Bradford_2001_black = data_bradford_2001['Black'].sum()
Total_Bradford_2001_white = data_bradford_2001['White'].sum()
Total_Bradford_2001_others = data_bradford_2001['Others'].sum()
# create an Empty DataFrame object
uk_population = pd.DataFrame(columns = ['year','Asian', 'Black', 'White', 'Others'])
print(uk_population)
Empty DataFrame Columns: [year, Asian, Black, White, Others] Index: []
# append rows to an empty DataFrame - uk_population
uk_population = uk_population.append({
'year' : '2021' ,
'Asian' : Total_UK_2021_asian,
'Black' : Total_UK_2021_black,
'White' : Total_UK_2021_white,
'Others' : Total_UK_2021_others},
ignore_index = True)
uk_population = uk_population.append({
'year' : '2011' ,
'Asian' : Total_UK_2011_asian,
'Black' : Total_UK_2011_black,
'White' : Total_UK_2011_white,
'Others' : Total_UK_2011_others},
ignore_index = True)
uk_population = uk_population.append({
'year' : '2001' ,
'Asian' : Total_UK_2001_asian,
'Black' : Total_UK_2001_black,
'White' : Total_UK_2001_white,
'Others' : Total_UK_2001_others},
ignore_index = True)
uk_population['Total'] = uk_population['Asian'] + uk_population ['Black'] + uk_population ['White'] + uk_population ['Others']
uk_population.head()
| year | Asian | Black | White | Others | Total | |
|---|---|---|---|---|---|---|
| 0 | 2021 | 5515892 | 2408980 | 48699515 | 2974062 | 59598449 |
| 1 | 2011 | 8427062 | 3729780 | 96418790 | 3576192 | 112151824 |
| 2 | 2001 | 4547668 | 2279020 | 95042004 | 2214618 | 104083310 |